This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
The purpose of this notebook is provide some examples and problems for students to explore the use of R in analyzing data. For this set of problems, we use the built-in dataset called ‘iris’, based on Ronald Fischer’s 1936 pioneering work on statistics in biology. It is a multivariate data set introduced in his paper, “The use of multiple measurements in taxonomic problems.”
Because it is included in the R distribution, you can get immediate help by entering:
?iris
he ? is a general purpose tool for command-line help.
Here is an illustration of the iris attributes.
If you want to explore other included datasets, type
library(help = “datasets”)
This will give you a list of all included datasets in the dataset library for R.
Nice! we have inline help which describes our dataset. This is available for all included datasets.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
First some information about the dataset.
dim(iris)
## [1] 150 5
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Here is a quick look at the data graphically. It demonstrates both the built-in plot function and the ggplot library.
Note that the ggplot shows each species in a different color, so that plot provides more information.
The next plot shows regions of each species. Thanks to Yu Yang Liu for this code on Kaggle. https://www.kaggle.com/c34klh123/iris-data-with-ggplot-shiny/code
The object convexHull is a function that computes the regions. It is called generating the iris2 dataframe.
Note: The “marginal points” are “NOT in the region”. They are “just outside the region”
convexHull<-function(df) df[chull(df$Sepal.Length,df$Sepal.Width),]
iris2<-plyr::ddply(iris,"Species",convexHull)
ggplot(iris,aes(Sepal.Length,Sepal.Width))+
geom_point(data=iris,aes(color=Species))+
geom_polygon(data=iris2,alpha=.3,aes(Sepal.Length,Sepal.Width,fill=Species))+
theme(legend.position = "bottom",plot.title = element_text(size = 15,hjust = 0.5))+
annotate("segment",x=6,xend=5.8,y=3.75,yend =4 ,arrow=arrow(),color="black")+
annotate("segment",x=6.2,xend=6.2,y=3.65,yend =3.4 ,arrow=arrow(),color="black")+
annotate("segment",x=6.1,xend=6,y=3.65,yend =3.4 ,arrow=arrow(),color="black")+
annotate("text",x=6.21,y=3.72,label="marginal points",color="black",size=3)
It is your turn now. You should fill in each section with R code to answer the question.
Hint: try using the dplyr library operators like group_by, summarize, and %>% (pipe).
Your code goes here:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [1] 5.843333
## [1] 3.057333
## [1] 3.758
## [1] 1.199333
| Species | mean(Sepal.Length) | mean(Sepal.Width) | mean(Petal.Length) | mean(Petal.Width) |
|---|---|---|---|---|
| setosa | 5.006 | 3.428 | 1.462 | 0.246 |
| versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
| virginica | 6.588 | 2.974 | 5.552 | 2.026 |
This should produce 6 graphs which look something like this that I created using the mtcars dataset. First the built in plot function, which plots mpg vs hp:
plot(x = mtcars$hp, y = mtcars$mpg,main="Miles per gallon vs. horsepower",xlab="Horsepower",ylab = "Miles per Gallon")
Here are similar plots using ggplot which demonstrates three side-by-side graphs with the same x and y axes:
ggplot (mtcars,aes(x=hp,y=mpg)) + geom_point() + facet_wrap (~cyl,nrow=1)
Try using ggplot and place the graphs 3 accross and two down, with Sepal in the upper set of graphs and petal in the lower.
Put the code for your plot here.
library(ggplot2)
ggplot (iris,aes(x=Sepal.Width,y=Sepal.Length)) + geom_point() + facet_wrap(~Species,nrow=1)
ggplot (iris,aes(x=Petal.Width,y=Petal.Length)) + geom_point() + facet_wrap(~Species,nrow=1)
Now with one of the ggplot renderings, add a linear regression line to the plot. For example:
# (by default includes 95% confidence region)
ggplot (mtcars,aes(x=hp,y=mpg)) + geom_point() + geom_smooth(method=lm)
ggplot (iris,aes(x=Sepal.Width,y=Sepal.Length)) + geom_point() + facet_wrap(~Species,nrow=1) +
geom_smooth(method=lm)
x=iris %>% gather(attribute,size , Sepal.Length:Petal.Width) xx = x %>% group_by (Species, attribute) %>% summarize(mean(size), sd(size))
Here is a pointer to dplyr and tidyr discussion.
https://rpubs.com/bradleyboehmke/data_wrangling
Your code goes here:
#Instead of this
setosa = iris[iris$Species=="setosa",]
versicolor = iris[iris$Species=="versicolor",]
virginica = iris[iris$Species=="virginica",]
mean(setosa$Sepal.Length)
## [1] 5.006
# use this to normalize the data
library(tidyr)
library(dplyr)
setosa = iris[iris$Species=="setosa",]
versicolor = iris[iris$Species=="versicolor",]
virginica = iris[iris$Species=="virginica",]
x=iris %>% gather(attribute,size , Sepal.Length:Petal.Width) %>%
group_by (Species, attribute) %>%
summarize(mean(size), sd(size))
x
## # A tibble: 12 x 4
## # Groups: Species [?]
## Species attribute `mean(size)` `sd(size)`
## <fctr> <chr> <dbl> <dbl>
## 1 setosa Petal.Length 1.462 0.1736640
## 2 setosa Petal.Width 0.246 0.1053856
## 3 setosa Sepal.Length 5.006 0.3524897
## 4 setosa Sepal.Width 3.428 0.3790644
## 5 versicolor Petal.Length 4.260 0.4699110
## 6 versicolor Petal.Width 1.326 0.1977527
## 7 versicolor Sepal.Length 5.936 0.5161711
## 8 versicolor Sepal.Width 2.770 0.3137983
## 9 virginica Petal.Length 5.552 0.5518947
## 10 virginica Petal.Width 2.026 0.2746501
## 11 virginica Sepal.Length 6.588 0.6358796
## 12 virginica Sepal.Width 2.974 0.3224966
Here is an example:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
mtcars$am[which(mtcars$am == 0)] = 'Automatic'
mtcars$am[which(mtcars$am == 1)] = 'Manual'
mtcars$am = as.factor(mtcars$am)
p = plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~am,
colors = c('#BF382A', '#0C4B8E')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'Gross horsepower'),
zaxis = list(title = '1/4 mile time')))
p
Your code here:
library(plotly)
pSpecies = plot_ly(iris, x = ~Sepal.Length, y=~Petal.Length, z=~Sepal.Width, color=~Species)
pSpecies
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
If you were considering a machine learning algorithm to predict iris species based on the measured attributes, which species do you think the the algorithm will do best?
Why?
With which two species is there a possibility of errors in prediction?
Why?
Nice web app for visualization https://yuyangliu.shinyapps.io/iris_result/
https://www.kaggle.com/c34klh123/iris-data-with-ggplot-shiny/notebook
http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
Nice 3d plotting libraries
Plotly - https://plot.ly/r/
Discussion of pipes https://www.r-bloggers.com/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/
Nice collection of ggplot graphs: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html